technovangelist / scripts / see inside your chromadb

see inside your chromadb

If you are using ChromaDB with RAG or any other usecase, debugging can be a bit…on the frustrating side. Back when we were just starting to build out Ollama I was building a simple rag tool for one of the examples. I remember I was just starting to work with Chroma, before understanding how it worked, I hit an issue and I wasn’t sure if the problem was with my query, or with getting the data into Chroma. I had no visibility into what Chroma was doing. And so I used the API to try to pull info out but I wasn’t seeing anything. Again, I wasn’t sure if the problem was my code or my query or maybe there just wasn’t any data in the database.

What I needed was a way to see what was inside ChromaDB to see where the problem was. There are a few stages we can go through to find out what is going on. First we can look at the base structure ChromaDB uses. And that’s SQLite. Which is pretty interesting. Using SQLite means we can easily run the system with a minimal number of files. And depending on the use case, accessing data in SQLite can be faster than accessing the same data in the filesystem, depending on how the database was architected.

There are a large number of tools for working with SQLite. SQLPro and Base come to mind as really easy tools to work with SQLite. But to see if your data is there, it requires a bit of spelunking and even better, some knowledge about the workings of sql and the structure of the ChromaDB database. But I can open up one of those tools, point it at the SQLite database used by chroma and start poking around the database. After a few tries I can find the data that corresponds to what I had added to the database.

But what if you want to go further. Well, another step you can take is to just use the rest api straight from the browser, but there is only so much you can do there with a few clicks. So if you started Chroma locally using port 8000, then you can probably just go to that url and tack on /docs to the end. And you see the API docs. For most of these endpoints you can click the “try it out” button. So i can click on the “list collections” endpoint and click “try it out”. Don’t enter anything and click execute and I get a list of collections.

So that’s good, but can I see if the embeddings are there? Well, not super easily. Go down to the query endpoint and specify the id of the collection which I got from the previous endpoint. If you execute it now, you get an error about embedding dimensions. That’s because its looking at the word “string” as an array and doesn’t know what to do. We need an embedding in there. To deal with this I created this simple script on my Mac that embeds some text and adds it to the clipboard. The reason I did this on a mac is that is what I am using. But hey, we have AI, so I used one of the programming models to translate it to something that would work on windows as well as on linux. Those are in the repo technovangelist/videoprojects. I can’t verify that they actually work because I didn’t actually try it on those other platforms. Now i can run ‘nomicembed what is the vision pro’ and paste that into the query and execute. I get a response and can see the most related entries. OK, but that required a bit of work. Is there a tool I can use to get the info more easily?

The best tool I found…ok, the only reliable tool I found was something called Chroma DB Admin. And that is one of the problems. There isn’t a great tool to work with vector db’s in general. I started working with Milvus and there is a tool for that. I am sure there are tools for other databases as well, but not a universal tool that exists. OK, that’s not entirely true. I did stumble on one that makes the promise of becoming a universal tool but for now it has lots of issues and I had it working for a second and then it failed. I’ll tell you more about that because it has a lot of promise if they can fix a few things. But other than that, there isn’t ‘one tool to manage them all’…because there is no standard way of working with vectors. Everyone is doing it a bit different.

ok, so chroma db admin. You can find this on github at flanker/chromadb-admin. Just clone it locally, then run the appropriate tool for you. I like using Bun, so I’ll do “bun install”, and then “bun dev”. Now I can open up the web interface at the URL shown.

You might have to point it to your database, and then a collection. but pretty quickly you see all of your documents in your vector db. And you see all the components of each doc. So here is an entry and I can see the source text, the embedding, and the metadata.

I can’t change anything here but i can find other documents that are similar to this one. Cool.

Now I mentioned there was another tool. If videos only had a lifetime of a few days or weeks, I might not even mention this, but they are around for longer, so by the time you see this, it may have become the most awesome tool ever and this will be an interesting historical record. But for now Vector Admin is interesting but awkward and hopefully soon will resolve some issues and will become incredible. The idea is to have a tool that manages all your vector databases whether they are chroma or pinecone or a few others.

My issues are that the install for now takes ages, then it requires this weird overhead of organizations and workspaces without any real benefit . But you create one of each and then can add your databases. And it supports Chroma, but it doesn’t seem to support local databases. It will work with Chroma in Docker though, just not if you run the chroma command locally. ok, So I spun up another database in docker and it worked fine. But then i tried to add the local one again and it overwrote the docker one and now I had nothing. So I tried removing the workspace and recreate it, but that didn’t help. So I removed the org and then was booted out to create a new org, but to do so I had to create a new user, but I couldn’t because I had already run the onboarding. So I was stuck and had to reinstall.

I would think this could be solved by hiding that whole orgs and workspaces stuff until the benefit of that overhead shows up, and then make it a bit easier to add multiple databases. I was really excited about the opportunity to sync vector databases, so I hope this starts to work better in the future. There are also a lot of features around editing documents in the database but that requires recreating the embeds, and the only option for that is to use openai. If you had used ollama to create the embeds with nomic embed, that isn’t going to help much. These are all just some simple polish that is needed, as the rest of the product is pretty great. And I am super excited to see where he goes from here.

I understand the motivation for the org and workspace overhead. It feels like an enterprise kind of feature and those enterprises, well they are the ones who are likely to be the folks who would pay for such a feature. I just hope it all gets tightened up a bit.

So those are the tools I know about to work with Chroma DB. Did you know about any of those? Do you know of any others? Do you think this type of tool is useful when working with Chroma? I would love to hear what you think in the comments below.

Thanks so much for being here. goodbye.